Support for Qwen-Image, Qwen-Image-Edit, and Qwen-Image-Edit-Plus#2072
Support for Qwen-Image, Qwen-Image-Edit, and Qwen-Image-Edit-Plus#2072
Conversation
|
Basic support for Qwen-Image-Edit done, with |
|
I fixed in the code the settings that works for my GPU for model = w.nunchaku_load_qwen_diffusion_model(
model_info.filename,
cpu_offload="enable",
num_blocks_on_gpu=16,
use_pin_memory="disable",
) |
|
I got an error here if I don't override checkpoint resolution in the style. Probably the resolution range can be flexible like Flux, although I notice the Edit model doesn't follow instructions properly when resolution isn't ~1MP Apart from that, it works well!
Is there any downside from using the Plus version?
You just create some "Reference" control layers for more image sources. The UI side of this already works and multiple image inputs are handled in
It's still unclear to me if this is really a bug, or an issue with input resolutions. Or what exactly the difference is between passing the image to the encode node vs. using ReferenceLatent...
I don't think this should be burden of the user, unless as a last resort. It's a bit annoying the nunchaku nodes don't figure this out themselves, they have the most information. For the Flux loader there was a heuristic at least. |
You're right, I always had a custom resolution in my styles. I'll add a valid resolution range.
Sadly the Plus version is pretty bad at transforming the styles (ie. to pixel art, drawing, etc. ). It is better at everything else though. And the 1Mp transform seems to be handled directly by the text encode Plus node. Right now I was in the way of adding a new architecture
I just saw that when copying the flux code, I'll use that for
I think it's a little bit of both, but there is no real downside to using the Concerning Qwen-Image-Edit-Plus, I find it even more prone to zooming out, and
We always have the possibility to simply detect the quantity of VRAM, and < 16GB add The num_blocks_on_gpu can be left to 1, the performance stays the same, the only change is that you keep some of your system RAM available since more of the model is left on the GPU. |
| def from_string(s: str, filename: str | None = None): | ||
| if s == "svdq": | ||
| return Quantization.svdq | ||
| elif filename and "qwen" in filename and "svdq" in filename: |
There was a problem hiding this comment.
Why was this needed? Was there a svdq file that wasn't detected?
There was a problem hiding this comment.
I added some logs to find what was the quantization value, and nothing is returned in the model entity ! I had to resort to the filename trick...
I'll try some more debugging.
There was a problem hiding this comment.
Ah you're right, the quant field wasn't set by the model detection
Fixed here: Acly/comfyui-tooling-nodes@f555efb
Yea I think Arch should really be for different architecture or at least "model ecosystem". I'm actually not even sure if Flux Kontext and Qwen Edit deserve their own Arch (edit models are technically just finetunes), but there are lots of things that don't work for edit models so it makes some sense. |
|
This is fabulous work - have been using Qwen on ComfyUI for the first time, waiting for it to hit Krita Diffusion. Only thing I would say is to make sure that the (locked) new Qwen default doesn't have a Nunchaku dependency. After a recent update, I could no longer get the default locked Kontext setting to work, as I didn't have Nunchaku installed and I just got an error. The problem there was that in the 'default' Kontext, 'Diffusion Architecture' was set to 'Automatic', and of course that cannot be changed by the user. 'Automatic' turned out to have Nunchaku requirements(!). So I just copied the locked default and changed the Diffusion Architecture to 'Flux Kontext'. |
The PR has been tested with GGUF and Nunchaku versions. I do not have a non-quantized qwen version, neither do I have a ComfyUI installation without the nunchaku package. Feel free to give it a try, theoretically there should be no problem without it. |
|
Here is the code of the internal resize in ComfyUI's samples = image.movedim(-1, 1)
total = int(384 * 384)
scale_by = math.sqrt(total / (samples.shape[3] * samples.shape[2]))
width = round(samples.shape[3] * scale_by)
height = round(samples.shape[2] * scale_by)
s = comfy.utils.common_upscale(samples, width, height, "area", "disabled")
images_vl.append(s.movedim(1, -1))
if vae is not None:
total = int(1024 * 1024)
scale_by = math.sqrt(total / (samples.shape[3] * samples.shape[2]))
width = round(samples.shape[3] * scale_by / 8.0) * 8
height = round(samples.shape[2] * scale_by / 8.0) * 8
s = comfy.utils.common_upscale(samples, width, height, "area", "disabled")
ref_latents.append(vae.encode(s.movedim(1, -1)[:, :, :, :3]))It looks like a simple fixed ratio resize to 1048576 total pixels (ie. 1024x1024), then rounded to a multiple of 8. There is a strange stuff with |
Yea. It resizes all images to (384*384) total pixels while keeping aspect ratio and attaches that to the LLM prompt. If VAE is provided it also resizes all images to (1024*1024) total pixels while keeping aspect ratio and attaches them to conditioning as if by using the Reference Latent node. The The question is whether the automatic resize is good or bad. For Flux Kontext I omitted it. Although it follows prompts better when images stay around 1MP, it can really degrade quality. Probably it's similar here. If I understood you correctly, the internal resize makes the "zoom in" behavior more likely. In any case we should pass the images, and don't need to pass the VAE, since we already handle ReferenceLatent anyway. |
The problem is that I was thinking of a thing more along the line of resizing the expected latent output to the same size of the first internal resize to see if it helps with the drifting. It's too bad the internal scale cannot be avoided. Maybe I should try some tests with a modified copy of the node... [edit] : Note: I did not try to chain multiple [edit2] : I inspected the code of |
|
Yes ! By completely ignoring the resize we got a pixel-perfect edit, even with using a 1856x1440 and a 1536x1536 reference images. The performance were pretty bad because of the size, but the output is really good. It seems to work even better if the reference latents are resized to have the same short side dimension. I'll update the code. |
…text_encode_qwen_image_edit
|
Using |
Yea that's a bug, the buttons are always linked to the image model regions. I can fix it later. |
|
Question: do we really need to support both edit model variants? From what I understood, 2509 is an evolution and may be further improved. Even if it doesn't do everything 100% better I'd rather not support the old version if it's almost obsolete already, and will probably be more so in the future. |
|
Another question: which model support LoRA? Do any of the Nunchaku quants support them? I wonder if we need to disable the LoRA UI, or at least make sure there's an error of some kind which explains LoRA can't be used. Not sure what the timeline is for Nunchaku Qwen Lora support, it would be really nice to have (also to only have 1 model and a Lightning sampler which uses the Lora) |
I'm with you on this one, I really dislike having to use many models where a few LoRA would be enough. The Nunchaku code to support LoRA loading is supposedly ready, cf. this comment : nunchaku-ai/ComfyUI-nunchaku#479 (comment) . I don't know how long it will take for it to be available in ComfyUI. |
Sadly the 2509 variant is seriously lacking in the styling department. It's a joke how bad it is, not even usable for this purpose. That's why I decided to keep both the recent and older architectures for now, otherwise it would amputate a pretty nice feature. As soon as they create a model capable of both, we can drop the older code ! I'm pretty sure they wont go back to the simple I'd rather keep the backward compatibility for now, as long at it does not increase the technical debt too much. |
|
I just saw on another computer that the Qwen svg icons necessitate a local font to be displayed correctly, I wrongly thought Inkskape would vectorize the text. 😅 I'll let you create better ones, those were only placeholders anyway. |
|
I made a small follow-up here: #2076 Have a look if you want, I moved the qwen-text-encode a bit further to the same level as regular clip-text-encode, hope I didn't break something. Thanks for your work! |
|
@shadowlocked I think your problem has nothing to do with Nunchaku. Chroma doesn't have a Nunchaku version at all. The built-in Flux/Qwen styles are setup to look for a number of potential models, including Nunchaku and non-Nunchaku variants, and take whatever they find. Maybe it's just that you put the model in the wrong folder? It should be in |










Basic support of Qwen-Image models, including Nunchaku SVDQ quantized versions.
I did not touch the models and node auto-installation part as I'm not familiar with it. For now you can try this PR as long as you have a Qwen-Image model downloaded (normal, gguf, or svdq) and a recent ComfyUI and ComfyUI-nunchaku versions.
You can also load a Lightning LoRA from https://huggingface.co/lightx2v/Qwen-Image-Lightning .
When using Lightning versions I recommend creating specific presets with
minimum_stepsset to 1. For example :NB : I'll try to add Qwen-Image-Edit support, and maybe Qwen-Image-Edit-Plus (ie. 2509) basic support with only one layer. For this later one it will heavily limit the possibilities offered by the model, but I'll rather have a separated PR for the UI modifications needed to handle multiple edit sources.Arch.qwen,Arch.qwen_e,Arch.qwen_e_pTextEncodeQwenImageEdit+ReferenceLatentTextEncodeQwenImageEditPlus+ chainedReferenceLatentnodesBug on UI when using Qwen-Image and linked Edit model(will be fixed later)